Up to this point, the discussion has been geared toward
convincing you that queues are the next best thing to sliced bread (and
YouTube). Now, let’s look at what Windows Azure queues are, what they
provide, and how to use them.1. Architecture and Data Model
The Windows Azure queue storage service is composed of two main
resources in its data model:
Queues
A queue contains many messages, and a Windows Azure storage
account can contain any number of queues. There is no limit to the
number of messages that can be stored in any individual queue.
However, there is a time limit: messages can be stored for only a
week before they are deleted. Windows Azure queues aren’t meant
for long-lived messages, though, so any message that sticks around
for such a long period of time is probably a bug.
Message
A message is a small piece of data up to 8 KB in size.
It is added to the queue using a REST API and delivered to a
receiver. Note that, although you can store data in any format,
the receiver will see it as base64-encoded data. Every message
also has some special properties, as shown in Table 1.
Table 1. Message properties
Name | Description |
---|
MessageID | This is a GUID that
uniquely identifies the message within the queue. |
VisibilityTimeout | This property specifies
the exact opposite of what its name suggests. This value
determines for how long a message will be invisible (that is, of
course, not visible) when it’s removed from the queue. (You’ll
learn how to use this to protect from crashing servers later in
this chapter.) After this time has elapsed, if the message
hasn’t been deleted, it’ll show up in the queue again for any
consumer. By default, this is 30 seconds. |
PopReceipt | The server gives the
receiver a unique PopReceipt
when a message is retrieved. This must be used in conjunction
with a MessageId to
permanently delete a message. |
MessageTTL | This specifies the Time
to Live (TTL) in seconds for a message. Note that the default
and the maximum are the same: seven days. If the message hasn’t
been deleted from the queue in seven days, the system will
lazily garbage-collect and delete it. |
2. The Life of a Message
The life of a message is a short (but exciting and fruitful) one.
Windows Azure queues deliver at-least-once semantics. In other words,
Windows Azure queues try really, really hard to ensure that someone
reads and finishes processing a message at least once. However, they
don’t guarantee that someone won’t see a message more than once
(no “at-most-once” semantics), nor that the
messages will appear in order.
Figure 1 shows the life of a
message.
The typical flow is something like the following:
A producer adds a message to the queue, with an optional
TTL.
The queue system waits for a consumer to take the message off
the queue. Regardless of what happens, if the message is on the
queue for longer than the TTL, the message gets deleted.
A consumer takes the message from the queue and starts
processing it. When the consumer takes the message off the queue,
Windows Azure queues make the message
invisible. Note that they are not deleted.
Windows Azure just flips a bit somewhere to note that it shouldn’t
return the message to consumers for the time being. The consumer is
given a PopReceipt. This is unique for every time a
consumer takes a message off the queue. If a consumer takes the
message multiple times, it’ll get a different PopReceipt every time. This entire step is
where things get interesting. Two scenarios can play out
here:
In the first scenario, the consumer finishes processing
the message successfully. The consumer can then tell the queue
to delete the message using the PopReceipt and MessageId. This is basically you
telling the queue, “Hey, I’m done processing. Nuke this
message.”
In the second scenario, the consumer crashes or loses
connectivity while processing the message. As noted earlier,
this can happen often in distributed services. You don’t want
queue messages to go unprocessed. This is where the invisibility
and the iVisibilityTimeout kick in. Windows Azure
queues wait the number of seconds specified by VisibilityTimeout, and then say, “Hmm,
this message hasn’t been deleted yet. The consumer probably
crashed—I’m going to make this message visible again.” At this
point, the message reappears on the queue, ready to be processed
by another consumer or the same consumer. Note that the original
crashing consumer could come back online and delete the
message—Windows Azure queues are smart enough to reconcile both
of these events and delete the message from the queue.
Picking the right VisibilityTimeout value depends on your
application. Pick a number that is too small and the message could show
up on the queue before your consumer has had a chance to finish
processing. Pick a timeout that is too large and processing the work
item could take a long time in case of a failure. This is one area where
you should experiment to see what number works for you.
In the real world, step 3b will see a different consumer pick up
and process the message to completion, while the first crashing consumer
is resurrected. Using this two-phase model to delete messages, Windows
Azure queues ensure that every message gets processed at least
once.
Note:
One interesting issue that occurs when messages get redelivered
on crashing receivers has to do with poison
messages. Imagine a message that maliciously or
nonmaliciously causes a bug in your code, and causes a crash. Since
the message won’t be deleted, it’ll show up in the queue again, and
cause another crash…and another crash…and over and over. Since it
stays invisible for a short period of time, this effect can go
unnoticed for a long period of time, and cause significant
availability issues for your service. Protecting against poison
messages is simple: get the security basics right, and ensure that
your worker process is resilient to bad input.
Poison messages will eventually leave your system when their TTL
is over. This could be an argument for making your TTLs shorter to
reduce the impact of bad messages. Of course, you’ll have to weigh
that against the risk of losing messages if your receivers don’t
process messages quickly enough.
3. Queue Usage Considerations
Windows Azure queues trip up people because they expect the service to
be just like MSMQ, SQL Service Broker, or
<insert-any-common-messaging-system>—and it
isn’t. You should be aware of some common “gotchas” when using Windows
Azure queues. Note that these really aren’t defects in the
system—they’re part of the package when dealing with highly scalable and
reliable distributed services. Some things just work differently in the
cloud.
3.1. Messages can be repeated (idempotency)
It is important that your code be idempotent when it comes to processing queue
messages. In other words, your code should be able to receive the same
message multiple times, and the result shouldn’t be any different.
There are several ways to accomplish this.
One way is to just do the work over and over again—transcoding
the same video a few times doesn’t really matter in the big picture.
In other cases, you may not want to process the same transaction
repeatedly (for example, anything to do with financial transactions).
Here, the right thing to do is to keep some state somewhere to
indicate that the operation has been completed, and to check that
state before performing that operation again. For example, if you’re
processing a payment, check whether that specific credit card
transaction has already happened.
3.2. Messages can show up out of order
This possibility trips up people since they expect a system
called a “queue” to always show first-in, first-out (FIFO)
characteristics. However, this isn’t easily possible in a large
distributed system, so messages can show up out of order once in a
while. One good way to ensure that you process messages in order is to
attach an increasing ID to every message, and reject messages that
skip ahead.
3.3. Time skew/late delivery
Time skew and late delivery are two different issues, but they are
related because they have to do with timing. When using Windows Azure
worker roles to process queue messages, you shouldn’t rely on the
clocks being in sync. In the cloud, clocks can drift up to a minute,
and any code that relies on specific timestamps to process messages
must take this into account.
Another issue is late delivery. When you place a message onto
the queue, it may not show up for the receiver for some time. Your
application shouldn’t depend on the receiver instantly getting to view
the message.